DOCUMENT RESUME 



ED 379 293 



TM 022 638 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 



Lang, William Steve; And Others 

Rasch Model Applications To Determine the Equivalence 
of a Readiness Test in Two Languages* 
Apr 93 

21p»; Paper presented at the International Objective 
Measurement Workshop (Atlanta, GA, April 10, 1993)* 
Table A contains very small, filled print* 
Reports - Evaluative/Feasibility (1A2) — 
Speeches/Conference Papers (150) 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MFOl/PCOl Plus Postage* 

Bias; Blacks; Computer Software; Cultural Awareness; 
''Cultural Differences; English; Goodness of Fit; 
Hispanic Americans; '''^Item Response Theory; 
Mul til ingual Materials; '^Preschool Children; 
Preschool Education; School Readiness; ''^School 
Readiness Tests; Sex Differences; Spanish; 
Statistical Analysis; Test Items; ''"^Test Validity; 
Whites 

'''Lollipop Test; '''Rasch Model 



ABSTRACT 

The Lollipop Test (La Prueba Lollipop) is a bilingual 
preschool readiness test (in both English and Spanish) that has been 
the subject of a number of studies to assess validity and detect 
cultural bias* Such studies have not dealt with item analysis as a 
way to measure cultural fairness. The Rasch model was used in a study 
of the Lollipop test which considered its cultural bias, the 
usefulness of the Rasch model, and the utility of the new software, 
IPARM* Subjects were- 61 4- and 5-year-olds in Ceorgia and Florida (25 
white, 24 black, and 12 Hispanic Americans) ♦ Results do not suggest 
gender or cultural bias for the test as a whole* The potentially 
biased items favor female or Hispanic students* The Rasch approach 
was useful, and the software was easy to use, presenting results on 
item functioning in easily understood format* Five tables present 
study data* (Contains 10 references*) (SLD) 
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Rasch Model Applications to Determine the Equivalence 
of a Readiness Test in Two Languages 

Bilingualism has been acknowledged to be a complex problem in 
psychoeducational assessment for over half a century (Arsenian, 1937). Several 
studies have commented on the serious drawbacks of using standardized, 
commercially developed, English language assessment instmments for bilingual 
students or as translations without intensive comparative research (Figueroa, 
1983, MardeU-Czudnowski, 1987). 

The Lollipop Test (La Prueba Lollipop) is a preschool readiness test in 
both English and Spanish which has been the subject of a number of studies to 
assess validity and detect cultural bias using correlation, regression and 
discriminant analysis statistics (Chew & Lang, in press, Lang, Chew & Shomber, 
1991, Chew & Lang, 1990). Unfortunately, these studies have focused on subtest 
or total scores and have not dealt with item analysis as a way to measure cultural 
fairness or bias. The primary problem with classical item analyses is sample 
dependency. Unless the same person can take both language forms of a test, the 
items cannot be easily compared for parallel functioning in classical item 
statistics. Preschoolers are rarely proficient in one language, much less two 
languages, so that these comparisons can be made, 

Rasch model statistics are useful here for two reasons. One is the sample 
independence of the item analyses. The other is the recently developed 
applications of between fit statistics in a useful computer application (Smith, 
1991). Rather than treat this as a test equating exercise with common-item or 
common-person calibration, the scores here are to be pooled in a single 
calibration while culture is treated as a demographic (like gender or race) in a 
bias detection approach. IPARM (Item and Person Analysis with the Rasch 
Model, 1992) offers new capabilities and graphic solutions to the task of quality 
control of person and item measures. For a fuller discussion of detecting item 
bias using the Rasch model, see Smith (1992). 
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Of particular interest in this study was the use of between fit statistics for 
the detection of bias (Smith,1991). As a powerful statistic for identifying 
measurement disturbance, between fit was most useful. Finding a qualifying 
sample of Hispanic preschoolers is relatively difficult and power with a small 
number of subjects was considered more important than the possibility of Type I 
error. Naturally, a signal that bias was present in test items, even by chance, 
would lead to conservative interpretation of scores and judgmental examination 
of the item"? instead of misplaced trust in the test results. ' In other words. Type I 
errors would not lead to concluding a test was culturally fair, so Type II errors 
were more practically to be avoided. 

The research design was intended to answer several questions. First, does 
the Spanish translation of the test perform the same as the English form on an 
item by item comparison? Second, do the Rasch model results parallel the 
previous classical studies? Are there any suggestions for using fit statistics and 
Rasch model analyses for the particular . use of cross-lingual test development 
which followed from the experience? What is the utility of the relatively hew 
software, IPARM? 

METHOD 

Subjects 

The subjects were 61 four and five year old preschoolers from public 
preschools and kindergartens in Georgia and Florida. A total of 7 schools were 
part of the data collection. The sample consisted of 25 white, 24 black and 12 
Hispanic children. The origmal sample was also split approximately in two by 
gender with 30 male and 31 female participants. 

The data were collected in March and April of 1992 and 1993 by 
examiners trained in the administration of the test. For Hispanic children, the 
examiner spoke both English and Spanish. As is typical with biUngual children in 
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the United States, they often spoke Spanish at home and English at school. The 
examiner informally asked the Hispanic children if Spanish was the language 
spoken at home and answered any questions regarding the testing permission for 
Spanish-speaking parents. 

Instrument 

La Prueba Lollipop (Chew, 1989) is an individually administered, 
criterion-referenced screening measure of school readiness consisting of the 
following four subtests: (1) Identification of Colors and Shapes; (2) Picture 
Description, Position, and Spatial Recognition; (3) Identification of Numbers and 
Counting; and (4) Identification of Letters and Writing. The test items are 
individually administered orally with a total score range of 0 to 69. Preliminary 
investigation of the Spanish edition of La Prueba Lollipop (Lang, Chew, and 
Schomber, 1992) found no evidence of systematic bias for bilingual (Spanish) 
groups. That study compared three groups, notably alike in major demographic 
characteristics and found evidence that the Spanish and EngUsh versions of the 
Lollipop Test performed similarly with students of comparative socioeconomic 
status. Conclusions drawn from that effort suggested construction of the test, and 
the measurement of school readiness has not been confounded with culturally 
loaded items, a problem often seen in test translations. The English version of 
the test has been found to be relatively independent of socioeconomic variables 
and requires approximately 15 to 20 minutes to administer and score (Chew and 
Morris, 1987). The concurrent validity (Chew and Morris, 1987) and the 
predictive validity (Chew and Lang, 1990) is well-documented in the literature. 

Procedure 

All children were identified by principals and school district admmistrators 
as eligible for this study. The requirements were simply that the students fall 
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within the age-range of the instrument, and that there was no objection from a 
parent or school official to testing the child. Each child was tested individually 
according to the standardized directions. 

Statistical Analysis 

Even though The Lollipop Test totals 69 possible points, the value of every 
item is not always one point in scoring. Some items award two points and some 
five points to a total. For purposes of analysis, the test items were entered as 
single point, dichotomous data. The result was that 58 separate items were 
mcluded. 

The data were first calibrated using the Rasch Model and the BIGSTEPS 
program. This created an initial item difficulty file with the associated fit 
statistics and item/person maps. The difficulty file was then used in a subsequent 
analysis using the IPARM software. In the second analysis, item subpopulation 
analyses were performed for ABILITY (three groups), SEX (male and female), 
and CULTURE (white, black, and Hispanic). For each of these breakdowns, 
between fit statistics, distracter analysis, and the predicted/observed proportion 
were produced. Person analyses were generated for each of the four subtests. 
For a complete discussion of BIGSTEPS and IPARM, see Smith (1991; 1992). 

Several points are in order here. All items were used in the IPARM 
analysis regardless of the initial item fit statistics generated by BIGSTEPS. All 
persons were included in the IPARM analysis regardless of the initial person fit 
statistics generated by BIGSTEPS. The rationale here was to have both the 
overall misfitting items and those items showing potential bias available 
simultaneously. Gender was known to be a variable without bias since the test 
had been examined for this before. It was included as a variable as a comparison 
to a unknown factor, culture. 

This was the first known Rasch analysis of The Lollipop Test. As such, the 
subtest scores were included as an analysis in addition to the test as a whole. This 
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test has several clearly different types of items related to preschool readiness. 
For example, recognition of numbers, letters and spatial relations of objects. For 
this analysis, the test was simply broken down into the subtests. It is of course 
possible that other items classifications (such as items which require pointing to 
respond, telling to respond, and drawing to respond) would make sense, but there 
is no reason at the moment to speculate better than the test author has designated. 
The fit of items to meet the requirements of the Rasch model was the intent here. 
For all fit statistics, 2.0 (95% confidence level) was considered the criteria for 
examination. 

RESULTS & DISCUSSION 



Initial Calibration Results 

The resuhs of BIGSTEPS analysis for The Lollipop Test is summarized in Table 
I. The item/person map reveals that our sample was particularly strong for the 
range of the test. All 58 items entered the analysis with 58 persons. Three 
persons we^e dropped. Two persons obtained perfect scores while one person 
was an judged a misfit with a 2.07 infit and a 3.60 outfit. There is the possibility 
that one person of 61 was simply a Type I error. An examination of that person 
report revealed a generally weak ability (-1.12) with unusually high scores in one 
particular subtest (a spontaneous response to a picture). The score seemed to 
reflect selective knowledge in that area with the obvious likelihood that this 
preschooler had been exposed to the subtest material in some enriched way 
(comparatively) or was shy and only decided to respond to the area of questions 
favored. 

Ten of the 58 items (8,10,14,18,19,21,22,23,25,28) had both infit and 
outfit greater than 2.0. Three of these items were from subtest 1 and were the 
more difficult of the the shape recognition and drawing tasks. Five items were 
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from subtest 2 and dealt with the more difficult spatial relations (left, underneath, 
first, last). Two items were from subtest 3 and involved the recognition of 
numbers. No items from subtest 4 were revealed as potential problems. Items 8, 
18, 19, 21, and 22 were among the most difficult on the test (1.43, 1.16, 2.25, 
1.29, and 2.25 respectively). Since the children had opportunity to "guess" by 
pointing randomly at the stimulus card, it is quite likely the disturbance was due 
to guessing. Item 14 involved drawing a square which was dependent on motor 
coordination and scoring effects. Items 21, 23, 25, and 28 were subject to 
guessing, but not as difficult. Item results are given in Table 2, 

hem and Person Analyses 

Two of the 58 items (5 & 28) revealed a between fit statistic greater than 
2.0 for sex differences. Both show advantage to female students. Item five deals 
with color recognition (Brown) while item 28 is number recognition ( "9"). 
Even though it is possible to imagine color recognition being gender related, 
there is no explanation for item 28 except to state again that the item seems to be 
subject to guessing as revealed in the first analysis. These item profiles are 
shown in Table 3. The overall gender between fit statistic was .01 indicating 
virtually no bias. It is possible that one or two items are simply Type I error. 

It is interesting to note that item five was spontaneously singled out by our 
interpreter as a possible problem for Hispanics since the translation to "brown" 
was not what she suggested and some children gave the answer "chocolate." 
There was no culture bias revealed for this item as it had a between fit of -.10. 

Three items appeared to show culture bias with a between fit greater than 
2.0 (9, 19, & 25). Again, two of these (19 & 25) had already been identified as 
problems in the initial caUbration. In all three cases the bias is in favor of 
Hispanic students with differences between the predicted and observed 
proportions of .223, .284, and .264 respectively. The item profiles are 
summarized in Table 4. The overall between fit for culture was .24 and no items 
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were revealed to be bias where black students were the criteria of 2.0 was met. 
The worst possible item (11) in this regard had a between fit of 1.84 and the 
black student difference was -.155. 

Sixteen of the 61 persons had an unweighted total fit or a weighted fit or a 
subtest between fit greater than 2.0. Since the subtests differed in content that 
might be taught simultaneously at school (such as letters and numbers or shapes 
and colors), but might not be emphasized with equal experience at home, it is 
quite likely that these preschoolers are subject to the whims of parental values and 
the lack of fit is somewhat expected at this early age. 

'Conclusions 

There does not appear to be any gender or culture bias for The Lollipop 
Test as a whole. Even the items which do not meet the between fit criteria of 2.0 
are mostly items which are a subset of misfitting items on the whole. It is 
interesting to note that all of the potentially biased items are in favor of female or 
Hispanic students. It seems that the author of the test has little to worry about 
with regard to underscoring traditional minorities. 

One hypothesis that was suggested above is that the items showing misfit 
were a result of the child guessing by randomly pointing to the answer on the 
stimulus card. Since some figures on a card are already used in earlier items by 
the time the child gets to the "misfitting" item, the child is guessing among the 
figures left over. Smith (1991) suggests that the power of the between fit statistic 
is useful in detecting systematic bias such as cultural or gender while the 
unweighted total fit statistic is more sensitive to random disturbances such as 
guessing. In examining the total unweighted fit statistics of the 5 potentially 
biased items, two (19 and 28) revealed evidence of possible guessing while three 
did not result in values greater than 2.0. In fact, when one looks at the total 
unweighted fit for all ten potential misfitting items, the contrast between those 
with high total unweighted fit and those with high between fit from the item and 

8 



n 
O 



person analysis is evident. These figures are summarized in Table 5. 

Based on Table 5, one might suggest that items 8, 19, 22, 23, and 28 seem 
to reflect more random disturbance and race or gender bias might be related to 
that characteristic, be it guessing or some other factor. On the other hand item 5 
seems to reflect systematic gender bias while items 9 and 25 might reflect 
systematic culture bias. Since 25 is a very difficult item and it shows an 
unweighted total fit of 1.14, there is a possibility that is random disturbance with 
a ceiling effect. 

Regardless of the final determination of the individual items, it seems 
appropriate to conclude with a statement about the utility of this type of analysis. 
Since this is a relatively small sample and a large number of statistics are utilized, 
the potential for some Type I error is great. Fortunately, that is fine when one 
considers the result of that error is likely that a test will not be used 
inappropriately. 

One problem is that the small sample used is likely to be a power problem 
where the analysis is less likely to detect bias. The choice of the between fit 
statistic was an attempt to offset that possibility, but there is only so much that can 
be done with subgroups of 12, 24, and 25. Quite likely more items would have 
been suggested as the sample size increased, but at a certain point the power 
would have become so great that proportional difference which are practically 
meaningless become significant. Tlie test examiner must find some balance here. 

Finally, the IP ARM software is a welcome addition to those who want to 
quickly and easily get results that send them back to the test with a serious intent 
to revise and edit. The results reveal many clues to item functioning in easily 
understood format. 
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Table 1 



Summary of BIGSTEPS Rasch Analysis of The Lollipop Test 
Map of Persons and Items 
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SUMMARY OF MEASURED STEPS 
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Table 1 



Summary of BIGSTEPS Analysi$ of The L>llipap Test 
Item Statistics and Map 
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Summary of IPARM Analysis of The Lollipop Test 
Items Identified as Potentially Gender Biased 
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Summary of IPARM Analysis of The Lollipop Test 
Items Identified as Potentially Culture Biased 
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TABLE 5 



Summar>' of Fit Statistics for Selected Items 

Items suggested by calibration: 

Total Unweighted Fit Between Fit Sex Between Fit Culture 
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Seventh International Objective Measurement Workshop 



Updated Information Sheet 



Time: Saturday, April 10 to Sunday, April 11 

Place: 206 White Hall 
Emory University 
Atlanta, GA 30322 
USA 

Co-Chairs: George Engelhard, Jr. , Emory University and Judy Monsaas, West Georgia College 
Phone: George Engelhard (404-727-0607 at office and 404-525-1115 at home) 



Saturday Night Dinner: Dinner will be at Jagger's Restaurant (Number 121 on your 
map) at 6:00 p.m. 



Van Schedule: A van will leave from Emory Inn to White Hail at 8:00 and 8:15 a.m. on 
Saturday and Sunday. 



Schedule Changes: 

Pender Pedler will not present a paper during Session 6 (11:00-12:00) 

Ben Wright will present a paper entitled The significance of divisibility in Rasch 

measurement" during Session 5 (8:30-10:30) in place of the last two presentations 
("Facets as ANOVA" and •'Measuring with unexpected relevant obversations") 



[Map on back] 



IT 



